Skip to content

feature: PoC to mknod devices for user namespace containers#5138

Draft
everzakov wants to merge 1 commit intoopencontainers:mainfrom
everzakov:userns-mknod
Draft

feature: PoC to mknod devices for user namespace containers#5138
everzakov wants to merge 1 commit intoopencontainers:mainfrom
everzakov:userns-mknod

Conversation

@everzakov
Copy link

@everzakov everzakov commented Mar 3, 2026

The pull request is a PoC to call mknod to create devices for user namespace containers. This feature removes the limitation that the device should be created before container creation.
The main idea is to call mknod in the initial user namespace and container mount namespace. Also, runc should chown the created device to set the right uid/gid in the user namespace.

The main Kernel restrictions for this solution:

  1. Mknod capability is checked only in the initial user namespace https://elixir.bootlin.com/linux/v6.17.9/source/fs/namei.c#L4228 .
  2. /dev mount also should be created in the initial user namespace https://elixir.bootlin.com/linux/v6.17.9/source/fs/super.c#L358 . Otherwise, its super block will have SB_I_NODEV and you won't be able to open device https://elixir.bootlin.com/linux/v6.17.9/source/fs/namei.c#L3441 .

To solve these problems we can reuse the existing mechanism - goCreateMountSources function https://github.com/opencontainers/runc/blob/v1.4.0/libcontainer/process_linux.go#L604
We will call mknod and mount tmpfs in the initial user namespace and container mount namespace. Also, we will chown them for the right uid/gid.

The main change for criu is that such devices won't be listed in the /proc/$pid/mountinfo (because we use mknod not bind).
Also, now i have some problems (Operation not permitted for remount) with restoring masked pathes (like /proc/kcore) that's why /dev/null is mount binded.

/dev before:

total 4
drwxr-xr-x    5 root     root           360 Mar  3 10:41 .
drwxr-xr-x   13 root     root          4096 Mar  3 10:41 ..
crw--w----    1 root     tty       136,   0 Mar  3 10:41 console
lrwxrwxrwx    1 root     root            11 Mar  3 10:41 core -> /proc/kcore
lrwxrwxrwx    1 root     root            13 Mar  3 10:41 fd -> /proc/self/fd
crw-rw-rw-    1 nobody   nobody      1,   7 Mar  3 09:51 full
drwxrwxrwt    2 root     nobody          40 Mar  3 10:41 mqueue
crw-rw-rw-    1 nobody   nobody      1,   3 Mar  3 09:51 null
lrwxrwxrwx    1 root     root             8 Mar  3 10:41 ptmx -> pts/ptmx
drwxr-xr-x    2 root     root             0 Mar  3 10:41 pts
crw-rw-rw-    1 nobody   nobody      1,   8 Mar  3 09:51 random
drwxrwxrwt    2 root     root            40 Mar  3 10:41 shm
lrwxrwxrwx    1 root     root            15 Mar  3 10:41 stderr -> /proc/self/fd/2
lrwxrwxrwx    1 root     root            15 Mar  3 10:41 stdin -> /proc/self/fd/0
lrwxrwxrwx    1 root     root            15 Mar  3 10:41 stdout -> /proc/self/fd/1
crw-rw-rw-    1 nobody   nobody      5,   0 Mar  3 10:41 tty
crw-rw-rw-    1 nobody   nobody      1,   9 Mar  3 09:51 urandom
crw-rw-rw-    1 nobody   nobody      1,   5 Mar  3 09:51 zero

/dev after:

total 4
drwxr-xr-x    5 root     root           380 Mar  3 10:31 .
drwxr-xr-x   13 root     root          4096 Mar  3 10:31 ..
crw-rw----    1 root     root        1,   9 Mar  3 10:31 another
crw--w----    1 root     tty       136,   0 Mar  3 10:31 console
lrwxrwxrwx    1 root     root            11 Mar  3 10:31 core -> /proc/kcore
lrwxrwxrwx    1 root     root            13 Mar  3 10:31 fd -> /proc/self/fd
crw-rw-rw-    1 root     root        1,   7 Mar  3 10:31 full
drwxrwxrwt    2 root     nobody          40 Mar  3 10:31 mqueue
crw-rw-rw-    1 nobody   nobody      1,   3 Mar  3 09:51 null
lrwxrwxrwx    1 root     root             8 Mar  3 10:31 ptmx -> pts/ptmx
drwxr-xr-x    2 root     root             0 Mar  3 10:31 pts
crw-rw-rw-    1 root     root        1,   8 Mar  3 10:31 random
drwxrwxrwt    2 root     root            40 Mar  3 10:31 shm
lrwxrwxrwx    1 root     root            15 Mar  3 10:31 stderr -> /proc/self/fd/2
lrwxrwxrwx    1 root     root            15 Mar  3 10:31 stdin -> /proc/self/fd/0
lrwxrwxrwx    1 root     root            15 Mar  3 10:31 stdout -> /proc/self/fd/1
crw-rw-rw-    1 root     root        5,   0 Mar  3 10:32 tty
crw-rw-rw-    1 root     root        1,   9 Mar  3 10:31 urandom
crw-rw-rw-    1 root     root        1,   5 Mar  3 10:31 zero

mountinfo before:

573 406 252:0 /tmp/bats-run-lRvmOr/runc.9GZDZi/bundle/rootfs / ro,relatime - ext4 /dev/mapper/ubuntu--vg-ubuntu--lv rw
575 573 0:61 / /proc rw,relatime - proc proc rw
576 573 0:62 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,uid=100000,gid=200000,inode64
577 576 0:63 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=200005,mode=620,ptmxmode=666
578 576 0:64 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k,uid=100000,gid=200000,inode64
579 576 0:59 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
580 573 0:65 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
581 580 0:29 /user.slice/user-1000.slice/target_userns /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw
582 576 0:5 /null /dev/null rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
583 576 0:5 /random /dev/random rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
584 576 0:5 /full /dev/full rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
585 576 0:5 /tty /dev/tty rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
586 576 0:5 /zero /dev/zero rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
587 576 0:5 /urandom /dev/urandom rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
407 576 0:63 /0 /dev/console rw,nosuid,noexec,relatime - devpts devpts rw,gid=200005,mode=620,ptmxmode=666
408 575 0:61 /bus /proc/bus ro,relatime - proc proc rw
409 575 0:61 /fs /proc/fs ro,relatime - proc proc rw
410 575 0:61 /irq /proc/irq ro,relatime - proc proc rw
411 575 0:61 /sys /proc/sys ro,relatime - proc proc rw
412 575 0:61 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw
526 575 0:66 / /proc/acpi ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64
527 575 0:67 / /proc/asound ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64
528 575 0:5 /null /proc/kcore rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
529 575 0:5 /null /proc/keys rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
530 575 0:5 /null /proc/latency_stats rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
531 575 0:5 /null /proc/timer_list rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
532 580 0:68 / /sys/firmware ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64
533 575 0:69 / /proc/scsi ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64

mountinfo after:

573 406 252:0 /tmp/bats-run-g9OAEQ/runc.jlHm5o/bundle/rootfs / ro,relatime - ext4 /dev/mapper/ubuntu--vg-ubuntu--lv rw
575 573 0:61 / /proc rw,relatime - proc proc rw
576 573 0:62 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,inode64
577 576 0:63 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=200005,mode=620,ptmxmode=666
578 576 0:64 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k,uid=100000,gid=200000,inode64
579 576 0:59 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
580 573 0:65 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
581 580 0:29 /user.slice/user-1000.slice/test_busybox /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw
582 576 0:5 /null /dev/null rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
407 576 0:63 /0 /dev/console rw,nosuid,noexec,relatime - devpts devpts rw,gid=200005,mode=620,ptmxmode=666
408 575 0:61 /bus /proc/bus ro,relatime - proc proc rw
409 575 0:61 /fs /proc/fs ro,relatime - proc proc rw
410 575 0:61 /irq /proc/irq ro,relatime - proc proc rw
411 575 0:61 /sys /proc/sys ro,relatime - proc proc rw
412 575 0:61 /sysrq-trigger /proc/sysrq-trigger ro,relatime - proc proc rw
526 575 0:66 / /proc/acpi ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64
527 575 0:67 / /proc/asound ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64
528 575 0:5 /null /proc/kcore rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
529 575 0:5 /null /proc/keys rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
530 575 0:5 /null /proc/latency_stats rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
531 575 0:5 /null /proc/timer_list rw,nosuid,relatime master:2 - devtmpfs udev rw,size=4785984k,nr_inodes=1196496,mode=755,inode64
532 580 0:68 / /sys/firmware ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64
533 575 0:69 / /proc/scsi ro,relatime - tmpfs tmpfs ro,uid=100000,gid=200000,inode64

Current PoC limitations are all about criu:

  1. The devices major/minor should be the same for checkpoint/restore.
  2. The user namespace info (host id, container id, length) should be the same for checkpoint/restore.
  3. Only simple mount scenarios are checked (e.g. not checked if user will mount container dev to some path).

Ps: now in my env i have some problems that dump / restore phases can be stuck but from restore log it should be ok. Maybe this is a misconfiguration in my env :=(.

CRIU changes: link.

Closes: #5137

Signed-off-by: Efim Verzakov <efimverzakov@gmail.com>
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach at all (see my comment in the issue you opened), but here are the most obvious issues I spotted. I'm really not a fan of this idea.

Comment on lines +352 to +365
destPath, err := securejoin.SecureJoin(rootfs, node.Path)
if err != nil {
return nil, err
}

err = createDeviceNode(rootfs, node, false)
if err != nil {
return nil, err
}

mountFile, err := os.OpenFile(destPath, unix.O_PATH|unix.O_CLOEXEC, 0)
if err != nil {
return nil, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all insecure. You need to use mknodat with OpenInRoot and then after doing the re-open (again with OpenInRoot) you need to do VerifyInode. libpathrs has helpers for tne first bit.

Comment on lines 1119 to 1128
for _, node := range config.Devices {
node.Uid = uint32(rootUID)
node.Gid = uint32(rootGID)
hostUID, err := config.HostUID(int(node.Uid))
if err == nil {
node.Uid = uint32(hostUID)
}
hostGID, err := config.HostGID(int(node.Gid))
if err == nil {
node.Gid = uint32(hostGID)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this in general, but in the err != nil case you need to use rootUID/rootGID otherwise the devices will be unmapped.

Comment on lines +736 to +746
if m.Device == "usernsMknod" {
// Create device in initial user ns
for _, device := range p.config.Config.Devices {
if device.Path == m.Source {
src, err = usernsMknod(p.config.Config.Rootfs, device)
break
}
}
if src == nil && err == nil {
err = fmt.Errorf("can not find device node")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A magical mount type really feels like the wrong way of doing this. It would also need a runtime-spec change.

Comment on lines +750 to +771
// Mount tmpfs in initial user ns and chown it
entry := mountEntry{Mount: m}
mountConfig := &mountConfig{
root: p.config.Config.Rootfs,
label: p.config.Config.MountLabel,
rootlessCgroups: p.config.Config.RootlessCgroups,
cgroupns: p.config.Config.Namespaces.Contains(configs.NEWCGROUP),
}
err := mountToRootfs(mountConfig, entry)
if err == nil {
destPath, _ := securejoin.SecureJoin(mountConfig.root, m.Destination)
mountFile, err := os.OpenFile(destPath, unix.O_PATH|unix.O_CLOEXEC, 0)
uid, _ := p.config.Config.HostRootUID()
gid, _ := p.config.Config.HostRootGID()
err = sys.FchownFile(mountFile, uid, gid)
if err == nil {
src = &mountSource{
file: mountFile,
Type: mountSourcePlain,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done with fsopen, this is all very wrong and pollutes the host mount table AFAICS?

Comment on lines +760 to +761
destPath, _ := securejoin.SecureJoin(mountConfig.root, m.Destination)
mountFile, err := os.OpenFile(destPath, unix.O_PATH|unix.O_CLOEXEC, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also insecure for the same reason as above, this should've been OpenInRoot. (But I think this code shouldn't exist.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: PoC to mknod devices for user namespace containers

2 participants